tutorial_regression2 (Score: 32.0 / 34.0)

  1. Test cell (Score: 1.0 / 1.0)
  2. Test cell (Score: 1.0 / 1.0)
  3. Test cell (Score: 1.0 / 1.0)
  4. Test cell (Score: 1.0 / 1.0)
  5. Test cell (Score: 1.0 / 1.0)
  6. Test cell (Score: 1.0 / 1.0)
  7. Test cell (Score: 1.0 / 1.0)
  8. Test cell (Score: 1.0 / 1.0)
  9. Test cell (Score: 1.0 / 1.0)
  10. Coding free-response (Score: 3.0 / 3.0)
  11. Written response (Score: 3.0 / 3.0)
  12. Test cell (Score: 3.0 / 3.0)
  13. Test cell (Score: 3.0 / 3.0)
  14. Test cell (Score: 0.0 / 1.0)
  15. Test cell (Score: 3.0 / 3.0)
  16. Test cell (Score: 0.0 / 1.0)
  17. Written response (Score: 3.0 / 3.0)
  18. Test cell (Score: 1.0 / 1.0)
  19. Test cell (Score: 3.0 / 3.0)
  20. Test cell (Score: 1.0 / 1.0)

Tutorial 9: Regression Continued

Lecture and Tutorial Learning Goals:

By the end of the week, you will be able to:

  • Recognize situations where a simple regression analysis would be appropriate for making predictions.
  • Explain the $k$-nearest neighbour ($k$-nn) regression algorithm and describe how it differs from k-nn classification.
  • Interpret the output of a $k$-nn regression.
  • In a dataset with two variables, perform $k$-nearest neighbour regression in R using tidymodels to predict the values for a test dataset.
  • Using R, execute cross-validation in R to choose the number of neighbours.
  • Using R, evaluate $k$-nn regression prediction accuracy using a test data set and an appropriate metric (e.g., root means square prediction error).
  • In a dataset with > 2 variables, perform $k$-nn regression in R using tidymodels to predict the values for a test dataset.
  • In the context of $k$-nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
  • Describe advantages and disadvantages of the $k$-nearest neighbour regression approach.
  • Perform ordinary least squares regression in R using tidymodels to predict the values for a test dataset.
  • Compare and contrast predictions obtained from $k$-nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset.

This tutorial covers parts of the Regression II chapter of the online textbook. You should read this chapter before attempting the worksheet.

In [1]:
### Run this cell before continuing.
library(tidyverse)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
options(repr.matrix.max.rows = 6)
source("tests.R")
source("cleanup.R")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
 dplyr     1.1.3      readr     2.1.4
 forcats   1.0.0      stringr   1.5.0
 ggplot2   3.4.4      tibble    3.2.1
 lubridate 1.9.3      tidyr     1.3.0
 purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
 Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

 broom        1.0.5      rsample      1.2.0
 dials        1.2.0      tune         1.1.2
 infer        1.0.5      workflows    1.1.3
 modeldata    1.2.0      workflowsets 1.0.1
 parsnip      1.1.1      yardstick    1.2.0
 recipes      1.0.8     

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
 scales::discard() masks purrr::discard()
 dplyr::filter()   masks stats::filter()
 recipes::fixed()  masks stringr::fixed()
 dplyr::lag()      masks stats::lag()
 yardstick::spec() masks readr::spec()
 recipes::step()   masks stats::step()
 Dig deeper into tidy modeling with R at https://www.tmwr.org

Warning message:
“package ‘GGally’ was built under R version 4.3.2”
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Attaching package: ‘testthat’


The following object is masked from ‘package:rsample’:

    matches


The following object is masked from ‘package:dplyr’:

    matches


The following object is masked from ‘package:purrr’:

    is_null


The following objects are masked from ‘package:readr’:

    edition_get, local_edition


The following object is masked from ‘package:tidyr’:

    matches


Attaching package: ‘rlang’


The following objects are masked from ‘package:testthat’:

    is_false, is_null, is_true


The following objects are masked from ‘package:purrr’:

    %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
    flatten_raw, invoke, splice


Predicting credit card balance

No description has been provided for this image

Source: https://media.giphy.com/media/LCdPNT81vlv3y/giphy-downsized-large.gif

Here in this worksheet we will work with a simulated data set that contains information that we can use to create a model to predict customer credit card balance. A bank might use such information to predict which customers might be the most profitable to lend to (customers who carry a balance, but do not default, for example).

Specifically, we wish to build a model to predict credit card balance (Balance column) based on income (Income column) and credit rating (Rating column).

We access this data set by reading it from an R data package that we loaded at the beginning of the worksheet, ISLR. Loading that package gives access to a variety of data sets, including the Credit data set that we will be working with. We will rename this data set credit_original to avoid confusion later in the worksheet.

In [2]:
credit_original <- Credit
credit_original
A data.frame: 400 × 12
IDIncomeLimitRatingCardsAgeEducationGenderStudentMarriedEthnicityBalance
<int><dbl><int><int><int><int><int><fct><fct><fct><fct><int>
1 14.891360628323411 Male No YesCaucasian333
2106.025664548338215FemaleYesYesAsian 903
3104.593707551447111 Male No No Asian 580
39857.872417132156712FemaleNoYesCaucasian138
39937.728252519214413 Male NoYesCaucasian 0
40018.7015524415564 7FemaleNoNo Asian 966

Question 1.1
{points: 1}

Select only the columns of data we are interested in using for our prediction (both the predictors and the response variable) and use the as_tibble function to convert it to a tibble (it is currently a base R data frame). Name the modified data frame credit (using a lowercase c).

Note: We could alternatively just leave these variables in and use our recipe formula below to specify our predictors and response. But for this worksheet, let's select the relevant columns first.

In [3]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
credit <- credit_original|>
        as_tibble()|>
        select("Balance", "Income", "Rating")
In [4]:
Grade cell: cell-9342aee7f4b97ddf Score: 1.0 / 1.0 (Top)
test_1.1()
Test passed 🎊
Test passed 🥳
Test passed 🥇
Test passed 😀
[1] "Success!"

Question 1.2
{points: 1}

Before we perform exploratory data analysis, we should create our training and testing data sets. First, split the credit data set. Use 60% of the data and set the variables we want to predict as the strata argument. Assign your answer to an object called credit_split.

Assign your training data set to an object called credit_training and your testing data set to an object called credit_testing.

In [5]:
Student's answer(Top)
set.seed(2000)
# your code here
#fail() # No Answer - remove if you provide an answer

credit_split <- initial_split(credit, prop = 0.60, strata = Balance)
credit_testing <- testing(credit_split)
credit_training <- training(credit_split)
In [6]:
Grade cell: cell-c6bf91ef0c8f21b5 Score: 1.0 / 1.0 (Top)
test_1.2()
Test passed 😸
Test passed 🥇
Test passed 😸
Test passed 🎊
Test passed 🌈
Test passed 😸
Test passed 🎉
Test passed 🥳
[1] "Success!"

Question 1.3
{points: 1}

Using only the observations in the training data set, use the ggpairs library create a pairplot (also called "scatter plot matrix") of all the columns we are interested in including in our model. Since we have not covered how to create these in the textbook, we have provided you with most of the code below and you just need to provide suitable options for the size of the plot.

The pairplot contains a scatter plot of each pair of columns that you are plotting in the lower left corner, the diagonal contains smoothed histograms of each individual column, and the upper right corner contains the correlation coefficient (a quantitative measure of the relation between two variables)

Name the plot object credit_pairplot.

In [7]:
Student's answer(Top)
options(repr.plot.height = 12, repr.plot.width = 12)
 credit_pairplot <- credit_training |> 
     ggpairs(
         lower = list(continuous = wrap('points', alpha = 0.4)),
         diag = list(continuous = "barDiag")
     ) +
    theme(text = element_text(size = 20))

# your code here
#fail() # No Answer - remove if you provide an answer
credit_pairplot
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
No description has been provided for this image
In [8]:
Grade cell: cell-883edd273699e4b7 Score: 1.0 / 1.0 (Top)
test_1.3()
Test passed 🌈
Test passed 🎉
Test passed 😀
[1] "Success!"

Question 1.4 Multiple Choice:
{points: 1}

Looking at the ggpairs plot above, which of the following statements is incorrect?

A. There is a strong positive relationship between the response variable (Balance) and the Rating predictor

B. There is a strong positive relationship between the two predictors (Income and Rating)

C. There is a strong positive relationship between the response variable (Balance) and the Income predictor

D. None of the above statements are incorrect

Assign your answer to an object called answer1.4. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [9]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
answer1.4 <- "C"
In [10]:
Grade cell: cell-921cf1869c166f49 Score: 1.0 / 1.0 (Top)
test_1.4()
Test passed 🌈
Test passed 🌈
[1] "Success!"

Question 1.5
{points: 1}

Now that we have our training data, we will fit a linear regression model.

  • Create and assign your linear regression model specification to an object called lm_spec.
  • Create a recipe for the model. Assign your answer to an object called credit_recipe.
In [11]:
Student's answer(Top)
set.seed(2020) #DO NOT REMOVE

# your code here
#fail() # No Answer - remove if you provide an answer
lm_spec <- linear_reg()|>
            set_engine("lm")|>
            set_mode("regression")

credit_recipe <- recipe(Balance ~ Income + Rating, data = credit_training)

print(lm_spec)
print(credit_recipe)
Linear Regression Model Specification (regression)

Computational engine: lm 


── Recipe ──────────────────────────────────────────────────────────────────────


── Inputs 

Number of variables by role

outcome:   1
predictor: 2

In [12]:
Grade cell: cell-a647adab28a3dfb2 Score: 1.0 / 1.0 (Top)
test_1.5()
Test passed 🌈
Test passed 🌈
Test passed 🎊
Test passed 🎉
Test passed 😀
[1] "Success!"

Question 1.6
{points: 1}

Now that we have our model specification and recipe, let's put them together in a workflow, and fit our simple linear regression model. Assign the fit to an object called credit_fit.

In [13]:
Student's answer(Top)
set.seed(2020) # DO NOT REMOVE

# your code here
#fail() # No Answer - remove if you provide an answer

credit_fit <- workflow() |>
            add_recipe(credit_recipe)|>
            add_model(lm_spec)|>
            fit(data = credit_training)
credit_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept)       Income       Rating  
   -528.014       -7.583        3.937  
In [14]:
Grade cell: cell-4664079ebe7d0892 Score: 1.0 / 1.0 (Top)
test_1.6()
Test passed 🌈
Test passed 🌈
Test passed 🎊
[1] "Success!"

Question 1.7 Multiple Choice:
{points: 1}

Looking at the slopes/coefficients above from each of the predictors, which of the following mathematical equations is correct for your prediction model?

A. $credit\: card \: balance = -528.014 -7.583*income + 3.937*credit\: card\: rating$

B. $credit\: card \: balance = -528.014 + 3.937*income -7.583*credit\: card\: rating$

C. $credit\: card \: balance = 528.014 -7.583*income - 3.937*credit\: card\: rating$

D. $credit\: card \: balance = 528.014 - 3.937*income + 7.583*credit\: card\: rating$

Assign your answer to an object called answer1.7. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [15]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
answer1.7 <- "A"
In [16]:
Grade cell: cell-7cb05955d5df0d29 Score: 1.0 / 1.0 (Top)
test_1.7()
Test passed 🎉
Test passed 😀
[1] "Success!"

Question 1.8
{points: 1}

Calculate the $RMSE$ to assess goodness of fit on credit_fit (remember this is how well it predicts on the training data used to fit the model). Return a single numerical value named lm_rmse.

In [17]:
Student's answer(Top)
set.seed(2020) # DO NOT REMOVE

lm_rmse <- credit_fit |>
         predict(credit_training) |>
       bind_cols(credit_training) |>
         metrics(truth = Balance, estimate = .pred) |>
        filter(.metric == "rmse") |>
        select(".estimate") |>
        pull()

# your code here
#fail() # No Answer - remove if you provide an answer
lm_rmse
167.317944534607
In [18]:
Grade cell: cell-8de81bb18dedbb48 Score: 1.0 / 1.0 (Top)
test_1.8()
Test passed 🌈
[1] "Success!"

Question 1.9
{points: 1}

Calculate $RMSPE$ using the test data. Return a single numerical value named lm_rmspe.

In [19]:
Student's answer(Top)
set.seed(2020) # DO NOT REMOVE

# your code here
#fail() # No Answer - remove if you provide an answer
lm_rmspe <- credit_fit |>
         predict(credit_testing) |>
       bind_cols(credit_testing) |>
         metrics(truth = Balance, estimate = .pred) |>
        filter(.metric == "rmse") |>
        select(".estimate") |>
        pull()


lm_rmspe
154.838964937623
In [20]:
Grade cell: cell-41031aad5e75b436 Score: 1.0 / 1.0 (Top)
test_1.9()
Test passed 🌈
[1] "Success!"

Question 1.9.1
{points: 3}

Redo this analysis using $k$-nn regression instead of linear regression. Use set.seed(2000) at the beginning of this code cell to make it reproducible. Use the same predictors and train - test data splits as you used for linear regression, and use 5-fold cross validation to choose $k$ from the range 1-10. Remember to scale and shift your predictors on your training data, and to apply that same standardization to your test data! Assign a single numeric value for $RMSPE$ for your k-nn model as your answer, and name it knn_rmspe.

In [21]:
Student's answer Score: 3.0 / 3.0 (Top)
set.seed(2000) # DO NOT REMOVE

knn_tune <- nearest_neighbor (weight_func = "rectangular", neighbors = tune())|>
        set_engine ("kknn")|>
        set_mode ("regression")

credit_recipe_2 <- recipe(Balance ~ Income + Rating, data = credit_training)|>
            step_scale(all_predictors())|>
            step_center(all_predictors())

credit_vfold <- vfold_cv(credit_training, v = 5, strata = Balance)

credit_workflow <- workflow()|>
                add_recipe(credit_recipe)|>
                add_model(knn_tune)

gridvals <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

knn_rmspe <- credit_workflow |>
            tune_grid(resamples = credit_vfold, grid = gridvals)|>
            collect_metrics()|>
            filter(.metric == "rmse")|>
            slice_min(mean, n = 1)

# This tells us that we should use 4 neighbors for our knn 

knn_neighbor <- nearest_neighbor (weight_func = "rectangular", neighbors = 4)|>
            set_engine("kknn")|>
            set_mode("regression")

knn_workflow <- workflow()|>
            add_recipe (credit_recipe_2)|>
            add_model (knn_neighbor)|>
            fit(data = credit_training)

knn_rmspe <- knn_workflow |>
       predict(credit_testing) |>
       bind_cols(credit_testing) |>
       metrics(truth = Balance, estimate = .pred) |>
        filter(.metric == "rmse") |>
        select(".estimate") |>
        pull()


#credit_result
# your code here
#fail() # No Answer - remove if you provide an answer
knn_rmspe
174.68803938016
In [ ]:

Question 1.9.2
{points: 3}

Discuss which model, linear regression versus $k$-nn regression, gives better predictions and why you think that might be happening.

Student's answer Score: 3.0 / 3.0 (Top)

Linear regression is a better model for making predictions than k-nn regression. I think this is happening because of the type of data we have. When we first looked at our data in Question 1.3, we could easily see that our variables (Income and Rating), both have a "linear" relationship, in the sense that the data provided shows that the growth is a straight line. When we have data of this type, linear regression typically performs better. This is because linear regression uses a line that best fits, to make approximations. Where as for Knn, it might not fair as well, since it is more flexible in terms of "curves" and shapes.

In short, this type of data is more suitable for linear regression, hence it performs better than knn regression

2. Ames Housing Prices

No description has been provided for this image

Source: https://media.giphy.com/media/xUPGGuzpmG3jfeYWIg/giphy.gif

If we take a look at the Business Insider report What do millenials want in a home?, we can see that millenials like newer houses that have their own defined spaces. Today we are going to be looking at housing data to understand how the sale price of a house is determined. Finding highly detailed housing data with the final sale prices is very hard, however researchers from Truman State Univeristy have studied and made available a dataset containing multiple variables for the city of Ames, Iowa. The data set describes the sale of individual residential property in Ames, Iowa from 2006 to 2010. You can read more about the data set here. Today we will be looking at 5 different variables to predict the sale price of a house. These variables are:

  • Lot Area: lot_area
  • Year Built: year_built
  • Basement Square Footage: bsmt_sf
  • First Floor Square Footage: first_sf
  • Second Floor Square Footage: second_sf

First, load the data with the script given below.

In [22]:
# run this cell

ames_data <- read_csv('data/ames.csv', col_types = cols()) |>
    select(lot_area = Lot.Area, 
           year_built = Year.Built, 
           bsmt_sf = Total.Bsmt.SF, 
           first_sf = `X1st.Flr.SF`, 
           second_sf = `X2nd.Flr.SF`, 
           sale_price = SalePrice) |>
    filter(!is.na(bsmt_sf))

ames_data
A tibble: 2929 × 6
lot_areayear_builtbsmt_sffirst_sfsecond_sfsale_price
<dbl><dbl><dbl><dbl><dbl><dbl>
317701960108016560215000
116221961 882 8960105000
142671958132913290172000
104411992 912 970 0132000
10010197413891389 0170000
96271993 996 9961004188000

Question 2.1
{points: 3}

Split the data into a train dataset and a test dataset, based on a 70%-30% train-test split. Use set.seed(2019). Remember that we want to predict the sale_price based on all of the other variables.

Assign the objects to ames_split, ames_training, and ames_testing, respectively.

Use 2019 as your seed for the split.

In [23]:
Student's answer(Top)
set.seed(2019) # DO NOT CHANGE!
# your code here
#fail() # No Answer - remove if you provide an answer




ames_split <- initial_split (ames_data, prop = 0.7, strata = sale_price)
ames_testing <- testing(ames_split)
ames_training <- training (ames_split)
In [24]:
Grade cell: cell-416374a3ce562c44 Score: 3.0 / 3.0 (Top)
# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create objects named ames_split, ames_training and ames_testing', {
    expect_true(exists("ames_split")) 
    expect_true(exists("ames_training")) 
    expect_true(exists("ames_testing"))  
    })
### BEGIN HIDDEN TESTS
test_that('ames_split should be a rsplit object.', {
    expect_true('rsplit' %in% class(ames_split))
    })
test_that('ames_training is not a tibble.', {
    expect_true('tbl' %in% class(ames_training))
    })
test_that('ames_training does not contain the correct number of rows and/or columns.', {
    expect_equal(dim(ames_training), c(2048, 6))
    expect_equal(digest(int_round(sum(ames_training$lot_area), 2)), '0f473284653f451d0cb5cea966f4fc14')
    expect_equal(digest(int_round(sum(ames_training$first_sf), 2)), '46b1007aee0c4135004ad8294c03f50d')
    })
test_that('ames_testing is not a tibble.', {
    expect_true('tbl' %in% class(ames_testing))
    })
test_that('ames_testing does not contain the correct number of rows and/or columns.', {
    expect_equal(dim(ames_testing), c(881, 6))
    expect_equal(digest(int_round(sum(ames_testing$lot_area), 2)), 'ef74702fa3efc82f4d97a2cd2eda7ef1')
    expect_equal(digest(int_round(sum(ames_testing$first_sf), 2)), '2b626f5c6c11e63c59ef70157245a8ec')
    })
print("Success!")
### END HIDDEN TESTS
Test passed 🎊
Test passed 🌈
Test passed 🎊
Test passed 😀
Test passed 🎉
Test passed 😀
[1] "Success!"

Question 2.2
{points: 3}

Let's start by exploring the training data. Use the ggpairs() function from the GGally package to explore the relationships between the different variables.

Assign your plot object to a variable named answer2.2.

In [25]:
Student's answer(Top)
set.seed(2020) # DO NOT REMOVE

options(repr.plot.height = 12, repr.plot.width = 12)
 answer2.2 <- ames_training |> 
     ggpairs(
         lower = list(continuous = wrap('points', alpha = 0.4)),
         diag = list(continuous = "barDiag")
     ) +
    theme(text = element_text(size = 20))



# your code here
#fail() # No Answer - remove if you provide an answer
answer2.2
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
No description has been provided for this image
In [26]:
Grade cell: cell-ee3115b616837197 Score: 3.0 / 3.0 (Top)
# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create a plot named answer2.2', {
    expect_true(exists("answer2.2")) 
})

### BEGIN HIDDEN TESTS
test_that('answer2.2 should be using data from ames_training', {
    expect_equal(int_round(nrow(answer2.2$data), 0), 2048)
    expect_equal(int_round(ncol(answer2.2$data), 0), 6)
})
test_that('answer2.2 should be a pairwise plot matrix.', {
    expect_true('ggmatrix' %in% c(class(answer2.2)))
    })
print("Success!")
### END HIDDEN TESTS
Test passed 🌈
Test passed 🌈
Test passed 🎊
[1] "Success!"

Question 2.3 Multiple Choice:
{points: 1}

Now that we have seen all the relationships between the variables, which of the following variables would not be a strong predictor for sale_price?

A. bsmt_sf

B. year_built

C. first_sf

D. lot_area

E. second_sf

F. It isn't clear from these plots

Assign your answer to an object called answer2.3. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [27]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
answer2.3 <- "D"
In [28]:
Grade cell: cell-020aa6e5f8a70372 Score: 0.0 / 1.0 (Top)
# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create an object called answer2.3', {
    expect_true(exists('answer2.3'))
})
### BEGIN HIDDEN TESTS
test_that('Solution is incorrect', {
    expect_equal(digest(answer2.3), 'f76b651ab8fcb8d470f79550bf2af53a')
    })
print("Success!")
### END HIDDEN TESTS
Test passed 🎉
── Failure: Solution is incorrect ──────────────────────────────────────────────
digest(answer2.3) not equal to "f76b651ab8fcb8d470f79550bf2af53a".
1/1 mismatches
x[1]: "c1f86f7430df7ddb256980ea6a3b57a4"
y[1]: "f76b651ab8fcb8d470f79550bf2af53a"

Error:
! Test failed
Traceback:

1. test_that("Solution is incorrect", {
 .     expect_equal(digest(answer2.3), "f76b651ab8fcb8d470f79550bf2af53a")
 . })
2. (function (envir) 
 . {
 .     handlers <- get_handlers(envir)
 .     errors <- list()
 .     for (handler in handlers) {
 .         tryCatch(eval(handler$expr, handler$envir), error = function(e) {
 .             errors[[length(errors) + 1]] <<- e
 .         })
 .     }
 .     attr(envir, "withr_handlers") <- NULL
 .     for (error in errors) {
 .         stop(error)
 .     }
 . })(<environment>)

Question 2.4 - Linear Regression
{points: 3}

Fit a linear regression model using tidymodels with ames_training using all the variables in the data set.

  • create a model specification called lm_spec
  • create a recipe called ames_recipe
  • create a workflow with your model spec and recipe, and then create the model fit and name it ames_fit
In [29]:
Student's answer(Top)
set.seed(2020) # DO NOT REMOVE

# your code here
#fail() # No Answer - remove if you provide an answer
lm_spec <- linear_reg ()|>
            set_engine("lm")|>
            set_mode("regression")

ames_recipe <- recipe (sale_price ~., data = ames_training)



ames_fit <- workflow()|>
            add_recipe(ames_recipe)|>
            add_model(lm_spec)|>
            fit(data = ames_training)







ames_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
(Intercept)     lot_area   year_built      bsmt_sf     first_sf    second_sf  
 -1.750e+06    4.576e-01    8.944e+02    3.868e+01    8.274e+01    7.631e+01  
In [30]:
Grade cell: cell-664d89562f972a45 Score: 3.0 / 3.0 (Top)
# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create an object named lm_spec', {
    expect_true(exists("lm_spec")) 
    })
test_that('Did not create an object named ames_recipe', {
    expect_true(exists("ames_recipe")) 
    })
test_that('Did not create an object named ames_fit', {
    expect_true(exists("ames_fit")) 
    })

### BEGIN HIDDEN TESTS
test_that('lm_spec is not a linear regression model', {
    expect_true('linear_reg' %in% class(lm_spec))
    })
test_that('lm_spec does not contain the correct specifications', {
    expect_equal(digest(as.character(lm_spec$mode)), 'b8bdd7015e0d1c6037512fd1396aef1a')
    expect_equal(digest(as.character(lm_spec$engine)), '0995419f6f003f701c545d050292f42d')
    })
test_that('ames_recipe is not a recipe', {
    expect_true('recipe' %in% class(ames_recipe))
    })
test_that('ames_recipe does not contain the correct variables', {
    expect_equal(digest(int_round(sum(ames_recipe$template$lot_area), 2)), '0f473284653f451d0cb5cea966f4fc14')
    expect_equal(digest(int_round(sum(ames_recipe$template$first_sf), 2)), '46b1007aee0c4135004ad8294c03f50d')
    })
test_that('ames_fit is not a workflow', {
    expect_true('workflow' %in% class(ames_fit))
    })
test_that('ames_fit does not contain the correct data', {
    expect_equal(digest(int_round(sum(ames_fit$pre$actions$recipe$recipe$template$lot_area), 2)), '0f473284653f451d0cb5cea966f4fc14')
    expect_equal(digest(int_round(sum(ames_fit$pre$actions$recipe$recipe$template$first_sf), 2)), '46b1007aee0c4135004ad8294c03f50d')
    })
test_that('ames_fit coefficients are incorrect', {
    expect_equal(digest(int_round(sum(ames_fit$fit$fit$fit$coefficients), 2)), '55cbb26ef620a305697488cc4877eaec')
    })
print("Success!")
### END HIDDEN TESTS
Test passed 🌈
Test passed 🌈
Test passed 🎊
Test passed 🎉
Test passed 😀
Test passed 😀
Test passed 🌈
Test passed 🥳
Test passed 🎉
Test passed 😀
[1] "Success!"

Question 2.5 True or False:
{points: 1}

Aside from the intercept, all the variables have a positive relationship with the sale_price. This can be interpreted as the value of the variables decrease, the prices of the houses increase.

Assign your answer to an object called answer2.5. Make sure your answer is in lowercase letters and is surrounded by quotation marks (e.g. "true" or "false").

In [31]:
# run this cell
ames_fit$fit$fit$fit$coefficients
(Intercept)
-1749701.4088015
lot_area
0.457633690493438
year_built
894.414231945236
bsmt_sf
38.6784110835774
first_sf
82.7399318778952
second_sf
76.3082733788392
In [32]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
answer2.5 <- "true"
In [33]:
Grade cell: cell-d02d466ab600f590 Score: 0.0 / 1.0 (Top)
# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create an object named answer2.5', {
    expect_true(exists("answer2.5")) 
    })
### BEGIN HIDDEN TESTS
test_that('Solution is incorrect', {
    expect_equal(digest(answer2.5), 'd2a90307aac5ae8d0ef58e2fe730d38b')
})
print("Success!")
### END HIDDEN TESTS
Test passed 🥇
── Failure: Solution is incorrect ──────────────────────────────────────────────
digest(answer2.5) not equal to "d2a90307aac5ae8d0ef58e2fe730d38b".
1/1 mismatches
x[1]: "05ca18b596514af73f6880309a21b5dd"
y[1]: "d2a90307aac5ae8d0ef58e2fe730d38b"

Error:
! Test failed
Traceback:

1. test_that("Solution is incorrect", {
 .     expect_equal(digest(answer2.5), "d2a90307aac5ae8d0ef58e2fe730d38b")
 . })
2. (function (envir) 
 . {
 .     handlers <- get_handlers(envir)
 .     errors <- list()
 .     for (handler in handlers) {
 .         tryCatch(eval(handler$expr, handler$envir), error = function(e) {
 .             errors[[length(errors) + 1]] <<- e
 .         })
 .     }
 .     attr(envir, "withr_handlers") <- NULL
 .     for (error in errors) {
 .         stop(error)
 .     }
 . })(<environment>)

Question 2.6
{points: 3}

Looking at the coefficients and intercept produced from the cell block above, write down the equation for the linear model.

Make sure to use correct math typesetting syntax (surround your answer with dollar signs, e.g. $0.5 * a$)

Student's answer Score: 3.0 / 3.0 (Top)

Predicted Sale Price = $-1749701$ + $lot\_area*0.457633690493438$ + $year\_built*894.414231945236$ + $bsmt\_sf+38.6784110835774$ + $first\_sf*82.7399318778952$ + $second\_sf*76.3082733788392$

Question 2.7 Multiple Choice:
{points: 1}

Why can we not easily visualize the model above as a line or a plane in a single plot?

A. This is not true, we can actually easily visualize the model

B. The intercept is much larger (6 digits) than the coefficients (single/double digits)

C. There are more than 2 predictors

D. None of the above

Assign your answer to an object called answer2.7. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [34]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
answer2.7 <- "C"
In [35]:
Grade cell: cell-e9fe30e9345df159 Score: 1.0 / 1.0 (Top)
# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create an object named answer2.7', {
    expect_true(exists("answer2.7")) 
    })
### BEGIN HIDDEN TESTS
test_that('Solution is incorrect', {
    expect_equal(digest(answer2.7), '475bf9280aab63a82af60791302736f6')
})
print("Success!")
### END HIDDEN TESTS
Test passed 🥳
Test passed 🥳
[1] "Success!"

Question 2.8
{points: 3}

We need to evaluate how well our model is doing. For this question, calculate the $RMSPE$ (a single numerical value) of the linear regression model using the test data set and assign it to an object named ames_rmspe.

In [36]:
Student's answer(Top)
set.seed(2020) # DO NOT REMOVE

# your code here
#fail() # No Answer - remove if you provide an answer


ames_rmspe <- ames_fit |>
         predict(ames_testing) |>
       bind_cols(ames_testing) |>
         metrics(truth = sale_price, estimate = .pred) |>
        filter(.metric == "rmse") |>
        select(".estimate") |>
        pull()

ames_rmspe
42698.6978213656
In [37]:
Grade cell: cell-83731b933e194459 Score: 3.0 / 3.0 (Top)
# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create an object named ames_rmspe', {
    expect_true(exists("ames_rmspe")) 
    })
### BEGIN HIDDEN TESTS
test_that('ames_rmspe is incorrect', {
    expect_equal(digest(int_round(ames_rmspe, 2)), '449c6dc6cc4df30b73051b58cabea411')
})
print("Success!")
### END HIDDEN TESTS
Test passed 🌈
Test passed 🌈
[1] "Success!"

Question 2.9 Multiple Choice:
{points: 1}

Which of the following statements is incorrect?

A. $RMSE$ is a measure of goodness of fit

B. $RMSE$ measures how well the model predicts on data it was trained with

C. $RMSPE$ measures how well the model predicts on data it was not trained with

D. $RMSPE$ measures how well the model predicts on data it was trained with

Assign your answer to an object called answer2.9. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [38]:
Student's answer(Top)
# your code here
#fail() # No Answer - remove if you provide an answer
answer2.9 <- "D"
In [39]:
Grade cell: cell-3831c36308e7c582 Score: 1.0 / 1.0 (Top)
# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create an object named answer2.9', {
    expect_true(exists("answer2.9")) 
    })
### BEGIN HIDDEN TESTS
test_that('Solution is incorrect', {
    expect_equal(digest(answer2.9), 'c1f86f7430df7ddb256980ea6a3b57a4')
})
print("Success!")
### END HIDDEN TESTS
Test passed 🎊
Test passed 🎉
[1] "Success!"
In [40]:
source("cleanup.R")
In [ ]: